class: center, middle, inverse, title-slide # Lecture 3 ## Summary Statistics ### Psych 10 C ### University of California, Irvine ### 04/01/2022 --- ## Summary Statistics - Another common way to summarize information from experimental or survey data is by using a statistic. -- - Statistics are **functions** of the **random variables** in an experiment that can be used to convey information about the location of our observations and how they vary. -- - Not all the variables in an experiment are equally important, so we don't usually look for ways to visualize them, but we still want to make sure that we gather all the information we can about our sample. -- - In that case we can use summary statistics to report some of their properties. --- class: inverse, center, middle # Random Variables and Functions --- ## Random variables - Statisticians are very bad at naming things... -- - When you think of the words "random" and "variable", what comes to mind first? -- - The formal definition is the opposite! -- - **Definition:** A random variable is a function of the outcomes of an experiment. --- # Functions - Functions have formal definitions, however, what's really important is that we remember how they "work". -- - Intuitively, we can think of functions as rules regarding how two groups of "things" are associated. -- - For example, imagine we throw a coin and record whether it ends up being heads or tails. A function could be a rule that states: - `\(x = 0\)` if the outcome is tails and `\(x = 1\)` if the outcome is heads. -- - This is a simple rule that lets us assign numbers to a variable `\(x\)` depending on the result of a coin toss. In other words, `\(x\)` is defined as a function of the outcome of the experiment. --- ## Functions - Another simple function would be `\(y = x + 1\)`. - This function tells us that whatever the value of `\(x\)` is, we can get the value of `\(y\)` by adding `\(1\)` to `\(x\)`. -- - Regardless of how complex they look, we can always think of functions as a "map" that specifies how to go from values on one variable to values on a second variable. --- ## Back to Random Variables - Random variables are neither random nor variables, they are simply the rules we use to assign numbers to the outcomes of an experiment. -- - In other words, random variables are deterministic functions (See? statisticians are really bad at naming things!). -- - In our previous example with the coin toss, `\(x\)` can be considered a random variable, as there's a rule that assigns a numeric value (0 or 1) to the outcome of the experiment (heads or tails). --- ## Example with the memory experiment. - Let's think back to our memory experiment. Each time we present one of the 50 words included in the original list, participants could respond that the word was on the original list or not. -- - We treat our participants' responses as probabilistic and we can create a random variable that says: -- - If the word was in the original list **and** the participant responded that the word was on the original list, then `\(x = 1\)`. -- - If the word was in the original list **and** the participant responded that the word was **not** in the original list, then `\(x = 0\)` -- - If we get the value of `\(x\)` corresponding to every trial for any given participant, we would have 50 different values that indicate whether each word was recognized correctly or if it was missed. --- ## Statistics - The examples we have talked about are all Statistics. A statistic is just a function of our sample (data). -- - In our memory experiment, we don't have a record of the responses each participant gave to each word, we have something "simpler". -- - We have a random variable that sums all the correct responses. -- - We have lost some information in doing so. Can you guess what information has been lost? -- - We traded-off information about the order in which the correct responses occur, in exchange of being able to summarize the whole experiment with a single number: the total number of correct responses. --- ## Statistics - Every time we use a statistic (i.e. a function of our experimental outcomes) we either: -- - Keep the same information (for example, when we assigned a value of 1 to every head outcome in a series of coin tosses) -- - Lose information (for example, when we calculated the total number of correct responses observed in the memory experiment) -- - This will not be a problem of interest in the majority of the examples we'll cover in this course, however, it is important to keep this loss of information in mind. --- class: inverse, center, middle # Commonly Used Statistics ## The Mean --- ## Mean - As you know from Psych 10B, one of the properties of any r.v. that we are interested in is their expected value. -- - Expected values can be computed from the formula: `$$\mathbb{E}(x) = \sum_x x \ p(x)$$` -- - We now face a problem, because whenever we gather data from an experiment, we don't know the probability of each value observed in our random variable. -- - In other words we don't know `\(p(x)\)`. -- - For example, what is the probability that a participant has exactly 40 correct responses? --- ## Mean - Fortunately, we can prove mathematically that the **average** of any random variable is always going to be close to its expected value. -- - This is true regardless of the value of `\(p(x)\)`. -- - Of course, this is just an approximation and therefore, there will be a certain discrepancy between the specific values computed. -- - But it will always be the best guess available! -- - Calculating the average is simple: `$$\bar{x} = \frac{\sum_{i = 1}^n x_i}{n}$$` -- - Here, `\(x_i\)` corresponds to each individual observation (remember, in the memory experiment we now have one observation for each participant, the number of correctly recognized words), and the variable `\(n\)` indicates to the total number of observations (participants). --- # Example: Average number of correct responses - Let's go back to the memory example, and look at the mean age of our participants. -- ```r mean_age <- memory %>% summarise("bar_age" = mean(age)) %>% pull(bar_age) ``` -- - We can now take a look at the mean age of our participants by typing the name of the variable in the console: ```r mean_age ``` ``` [1] 36.68 ``` --- # Note: - On Homework 1, you are expected to print average values in-text (_"in line chunks"_), an easy way to do this is using the following code in the text: -- ``` ` r name-of-variable` ``` -- - This will print the value of `name-of-the-variable` directly to the pdf as: "The mean age of the participants on the experiment was 36.68". --- # Mean by group - Using the `group_by()` function of the `tidyverse` package, we can group our observations with respect to a specific variable. That way, we can compute group-specific statistics to summarize our data. -- - In our memory example, if we want to calculate the mean number of correct responses registered on each test (test-1 vs test-2) we can use the following code: ```r mean_test <- memory %>% group_by(test_id) %>% summarise("mean" = mean(correct)) ``` -- - We now exclude the 'pull()' function because we need to know the value per test. --- # Mean by group - We can now look at the result by typing the name of our new variable into the console: ```r mean_test ``` ``` # A tibble: 2 × 2 test_id mean <chr> <dbl> 1 test_1 44.9 2 test_2 37.1 ``` -- - We can also look at the results, one at a time: .pull-left[ - test 1: ```r mean_test$mean[1] ``` ``` [1] 44.94 ``` ] .pull-right[ - test 2: ```r mean_test$mean[2] ``` ``` [1] 37.12 ``` ] --- class: inverse, center, middle # Commonly used Statistics ## Sample variance --- # Variance - Another important property of random variables is the variance. -- - **Definition:** The variance is the expected squared distance between the random variable and its expected value. -- - In other words, the variance is an expectation computed with respect to another expectation. -- `$$\mathbb{V}ar(x) = \mathbb{E}[(x - \mathbb{E}(x))^2]$$` -- - Based on the definition of an expected value, we can rewrite this as: -- `$$\mathbb{E}(x) = \sum_x x \ p(x)$$` -- `$$\mathbb{E}[(x - \mathbb{E}(x))^2] = \sum_x (x - \mathbb{E}(x))^2\ p(x)$$` -- - Once again, based what we know about expected values, it's reasonable to question whether we can find a good approximation to the variance. --- # Sample variance - Let's go back to the formal definition of Variance. `$$\mathbb{E}[(x - \mathbb{E}(x))^2] = \sum_x (x - \mathbb{E}(x))^2\ p(x)$$` -- - There's two things that we're clearly missing: the probability of the outcomes `\(p(x)\)` and the expected value of `\(x\)` ( `\(\mathbb{E}(x)\)` ). -- - But we already established that we have a good approximation to the expected value, the mean! -- - Therefore, to approximate the variance we can use the average distance between each of our observations and the mean. -- - This is called the sample variance and it is a relatively good approximation to the real variance: `$$s^2 = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n}$$` --- # Note: - In other classes, you might encounter the sample variance as being defined with an `\(n-1\)` in the denominator: `$$s^2 = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$$` -- - This is indeed the most appropriate approximation and the one that R uses by default (therefore, the one you need to use on homework 1). -- - However, for the sake of simplicity, in this class we will use the definition that uses `\(n\)` in the denominator. --- ## Sample variance - We can get the sample variance using a similar code as with the mean. For example, the variance of the age of our participants can be computed as: ```r var_age <- memory %>% group_by(test_id) %>% summarise("var_age" = var(age)) %>% pull(var_age) ``` -- - In this case we need to use the **`group_by()`** function because each participant has been entered into the data twice! -- - Remember that the variance is in squared units! Its interpretation isn't straight-forward. -- - The variance in participants age was 213.3511111. --- # Sample variance by group - We can calculate the variance of the number of correct responses per test in our memory experiment. -- - The code will be almost the same as before: ```r var_test <- memory %>% group_by(test_id) %>% summarise("variance" = var(correct)) ``` -- - And once again, we can look at each result, one at a time by: .pull-left[ - test 1: ```r var_test$variance[1] ``` ``` [1] 4.541818 ``` ] .pull-right[ - test 2: ```r var_test$variance[2] ``` ``` [1] 29.09657 ``` ]